Goto

Collaborating Authors

 feedback and unknown transition function


Online Stochastic Shortest Path with Bandit Feedback and Unknown Transition Function

Neural Information Processing Systems

We consider online learning in episodic loop-free Markov decision processes (MDPs), where the loss function can change arbitrarily between episodes. The transition function is fixed but unknown to the learner, and the learner only observes bandit feedback (not the entire loss function). For this problem we develop no-regret algorithms that perform asymptotically as well as the best stationary policy in hindsight. Assuming that all states are reachable with probability $\beta > 0$ under any policy, we give a regret bound of $\tilde{O} ( L|X|\sqrt{|A|T} / \beta)$, where $T$ is the number of episodes, $X$ is the state space, $A$ is the action space, and $L$ is the length of each episode. When this assumption is removed we give a regret bound of $\tilde{O} ( L^{3/2} |X| |A|^{1/4} T^{3/4})$, that holds for an arbitrary transition function. To our knowledge these are the first algorithms that in our setting handle both bandit feedback and an unknown transition function.



Reviews: Online Stochastic Shortest Path with Bandit Feedback and Unknown Transition Function

Neural Information Processing Systems

The submission studies the adversarial online learning in episodic loop-free Markov decision processes. The importance of this work is that it is the first to provide the understanding to an adversarial online learning problem where the transition function is unknown, the loss functions are changing, and each feedback is bandit. The related work clearly describe the line of this research field from fixing an unknown transition and an unknown loss function to the setting studied in this submission. Although the MDPs considered in the submission is L-layered and loop-free, the results and the analysis pave the way for general MDPs. The main idea is the design of the confidence sets to include the optimal occupancy measure which induces the optimal policy.


Online Stochastic Shortest Path with Bandit Feedback and Unknown Transition Function

Neural Information Processing Systems

We consider online learning in episodic loop-free Markov decision processes (MDPs), where the loss function can change arbitrarily between episodes. The transition function is fixed but unknown to the learner, and the learner only observes bandit feedback (not the entire loss function). For this problem we develop no-regret algorithms that perform asymptotically as well as the best stationary policy in hindsight. Assuming that all states are reachable with probability \beta 0 under any policy, we give a regret bound of \tilde{O} ( L X \sqrt{ A T} / \beta), where T is the number of episodes, X is the state space, A is the action space, and L is the length of each episode. When this assumption is removed we give a regret bound of \tilde{O} ( L {3/2} X A {1/4} T {3/4}), that holds for an arbitrary transition function.


Online Stochastic Shortest Path with Bandit Feedback and Unknown Transition Function

Neural Information Processing Systems

We consider online learning in episodic loop-free Markov decision processes (MDPs), where the loss function can change arbitrarily between episodes. The transition function is fixed but unknown to the learner, and the learner only observes bandit feedback (not the entire loss function). For this problem we develop no-regret algorithms that perform asymptotically as well as the best stationary policy in hindsight. Assuming that all states are reachable with probability $\beta 0$ under any policy, we give a regret bound of $\tilde{O} ( L X \sqrt{ A T} / \beta)$, where $T$ is the number of episodes, $X$ is the state space, $A$ is the action space, and $L$ is the length of each episode. When this assumption is removed we give a regret bound of $\tilde{O} ( L {3/2} X A {1/4} T {3/4})$, that holds for an arbitrary transition function.